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Abstract 

We introduce an algorithm that, given n ob- 
jects, learns a similarity matrix over all "n? 
pairs, from crowdsourced data alone. The al- 
gorithm samples responses to adaptively cho- 
sen triplet-based relative-similarity queries. 
Each query has the form "is object a more 
similar to h or to c?" and is chosen to be 
maximally informative given the preceding 
responses. The output is an embedding of 
the objects into Euclidean space (like MDS); 
we refer to this as the "crowd kernel." SVMs 
reveal that the crowd kernel captures promi- 
nent and subtle features across a number of 
domains, such as "is striped" among neckties 
and "vowel vs. consonant" among letters. 



1. Introduction 

Essential to the success of machine learning on a new 
domain is determining a good "similarity function" be- 
tween objects (or alternatively defining good object 
"features"). With such a "kernel," one can perform 
a number of interesting tasks, e.g. binary classifica- 
tion using Support Vector Machines, clustering, inter- 
active database search, or any of a number of other 
off-the-shelf kernelized applications. Since this step of 
determining a kernel is most often the step that is still 
not routinized, effective systems for achieving this step 
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are desirable as they hold the potential for completely 
removing the machine learning researcher from "the 
loop." Such systems could allow practitioners with no 
machine learning expertise to employ learning on their 
domain. In many domains, people have a good sense 
of what similarity is, and in these cases the similarity 
function may be determined based upon crowdsourced 
human responses alone. 

The problem of capturing and extrapolating a human 
notion of perceptual similarity has received increasing 
attention in recent years including areas such as vi- 
sion ( Agarwal et al. , 2007) , audition (McFee & Lanck- 
rict, 2009), information retrieval (Schultz & Joachims, 
2003) and a variety of others represented in the UCI 
Datasets (Xing et al., 2003; Huang et al., 2010). Con- 
cretely, the goal of these approaches is to estimate a 
similarity matrix K over all pairs of n objects given a 
(potentially exhaustive) subset of human perceptual 
measurements on tuples of objects. In some cases 
the set of human measurements represents 'side infor- 
mation' to computed descriptors (MFCC, SIFT, etc.), 
while in other cases - the present work included - one 
proceeds exclusively with human reported data. When 
if is a positive semidefinite matrix induced purely 
from distributed human measurements, we refer to it 
as the crowd kernel for the set of objects. 

Given such a Kernel, one can exploit it for a vari- 
ety of purposes including exploratory data analysis or 
embedding visualization (as in Multidimensional Scal- 
ing) and relevance-feedback based interactive search. 
As discussed in the above works and (Kendall & Gib- 
bons, 1990), using a triplet 6ased representation of rel- 
ative similarity, in which a subject is asked "is object 



Adaptively Learning the Crowd Kernel 




Figure 1. A sample top-level of a similarity search system 
that enables a user to search for objects by similarity. In 
this case, since the user clicked on the middle-left tile, she 
will "zoom-in" and be presented with similar tiles. 



a more similar to b or to c," has a number of desir- 
able properties over the classical approach employed 
in Multi-Dimensional Scaling (MDS), i.e., asking for a 
numerical estimate of "how similar is object a to &." 
These advantages include reducing fatigue on human 
subjects and alleviating the need to reconcile individu- 
als' scales of similarity. The obvious drawback with the 
triplet based method, however, is the potential 0{n^) 
complexity. It is therefore expedient to seek methods 
of obtaining high quality approximations of K from 
as small a subset of human measurements as possible. 
Accordingly, the primary contribution of this paper is 
an efficient method for estimating K via an informa- 
tion theoretic adaptive sampling approach. 

At the heart of our approach is a new scale-invariant 
Kernel approximation model. The choice of model is 
shown to be crucial in terms of the adaptive triples 
that are produced, and the new model produces effec- 
tive triples to label. Although this model is noncon- 
vex, we prove that it can be optimized under certain 
assumptions. 

We construct an end-to-end system for interactive vi- 
sual search and browsing using our Kernel acquisition 
algorithm. The input to this system is a set of im- 
ages of objects, such as products available in an online 
store. The system automatically crowdsources^ the 

'^Crowdsourcing was done on Amazon's Mechanical 
Turk, http://mturk.com. 



kernel acquisition and then uses this kernel to produce 
a visual interface for searching or browsing the set of 
products. Figure 1 shows this interface for a dataset 
of 433 floor tiles available at amazon.com. 

1.1. Human kernels versus machine kernels 

The bulk of work in Machine Learning focuses on "Ma- 
chine Kernels" that are computed by computer from 
the raw data (e.g., pixels) themselves. Additional work 
employs human experiments to try to learn kernels 
based upon machine features, i.e., to approximate the 
human similarity assessments based upon features that 
can be derived by machine. In contrast, when a ker- 
nel is learned from human subjects alone (whether it 
be data from an individual or a crowd) one requires 
no machine features whatsoever. To the computer, 
the objects are recognized by ID's only - the images 
themselves are hidden from our system and are only 
presented to humans. 

The primary advantage of machine kernels is that they 
can generalize immediately to new data, whereas each 
additional object needs to be added to our system, for 
a cost of approximately $0.15.^ On the other hand, 
working with a human kernel has two primary advan- 
tages. First, it does not require any domain expertise. 
While for any particular domain, such as music or im- 
ages of faces, cars, or sofas, decades of research may 
have provided high-quality features, one does not have 
to find, implement, and tunc these sophisticated fea- 
ture detectors. 

Second, human kernels contain features that are sim- 
ply not available with state-of-the-art feature detec- 
tors, because of knowledge and experience that hu- 
mans possess. For example, from images of celebri- 
ties, human similarity may be partly based on whether 
the two celebrities are both from the same profession, 
such as politicians, actors, and so forth. Until the long- 
standing goal of bridging the semantic gap is achieved, 
humans will be far better than machines at interpret- 
ing certain features, such as "does a couch look com- 
fortable," "can a shoe be worn to an informal occa- 
sion," or "is a joke funny." 

We give a simple demonstration of external knowledge 
through experiments on 26 images of the lower-case 
Roman alphabet. Here, the learned Kernel is shown 
to capture features such as "is a letter a vowel versus 
consonant," which uses external knowledge beyond the 
pixels. Note that this experiment is interesting in itself 

^This price was empirically observed to yield "good per- 
formance" across a number of domains. See the experimen- 
tal results section for evaluation criteria. 
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because it is not at first clear if people can meaning- 
fully answer the question: "is the letter e more similar 
to i or p." Our experiments show statistically sig- 
nificant consistency with 58%'^ (±2%, with 95% confi- 
dence) agreement between users on a random triple of 
letters. (For random image triples from an online tie 
store, 68% agreement is observed, and 65% is observed 
for floor tile images). 

2. Benefits of adaptation 

We first give high-level intuition for why adaptively 
choosing triples may yield better kernel approxima- 
tions than randomly chosen triples. Consider n objects 
organized in a rooted tree with £ n leaves, inspired 
by, say, phylogenic trees involving animal species.^ Say 
the similarity between objects is decreasing in their 
distance in the tree graph and, furthermore, that ob- 
jects are drawn uniformly at random from the classes 
represented by the leaves of the tree. Ignoring the de- 
tails of how one would identify that two objects are in 
the same leaf or subtree, it is clear that a nonadaptive 
method would have to ask U{n£) questions to deter- 
mine the leaves to which n objects belong (or at least 
to determine which objects are in the same leaves), be- 
cause an expected queries are required per object 
until just a second object is chosen from the same leaf. 
On the other hand, in an ideal setting, an adaptive ap- 
proach might determine such matters using 0(?T.log^) 
queries in a balanced binary tree, proceeding from the 
root down, assuming a constant number of compar- 
isons can determine to which subtree of a node an 
object belongs, hence an exponential savings. 

3. Related work 

As discussed above, much of the work in machine 
learning on learning kernels employs 'side informa- 
tion' in the form of features about objects. Schultz 
& Joachims (2003) highlight the fact that triple-based 
information may also be gathered by web search click 
data. Agarwal et al. (2007) is probably the most sim- 
ilar work, in which they learn a kernel matrix from 
triples of similarity comparisons, as we do. However, 
the triples they consider are randomly (nonadaptively) 
chosen. Their particular fitting algorithm differs in 

^While this fraction of agreement seems small, it corre- 
sponds to about 25% "noise," e.g., if 75% of people would 
say that a is more like b then c, then two random people 
would agree with probability 0.75^ = 0.56. 

*This example is based upon a tree metric rather than 
a Euclidean one. However, note that any tree with £ leaves 
can be embedded in ^-dimensional Euclidean space, where 
squared Euclidean distance equals tree distance. 



that it is based on a max-margin approach, which is 
more common in the kernel learning literature. 

There is a wealth of work in active learning for classifi- 
cation, where a learner selects examples from a pool of 
unlabeled examples to label. A number of approaches 
have been employed, and our work is in the same 
spirit as those that employ probabilistic models and 
information-theoretic measures to maximize informa- 
tion. Other work often labels examples based on those 
that are closest to the margin or closest to 50% prob- 
ability of being positive or negative. To see why this 
latter approach may be problematic in our setting, one 
could imagine a set of triples where we have accurately 
learned that the answer is 50/50, e.g., as may be the 
case if a, b, and c bear no relation to each other or if 
they are identical. One may not want to focus on such 
triples. 

4. Preliminaries 

The set of n objects is denoted by [n] = {1,2, ... ,n}. 
For a,b,c [n], a comparison or triple is of the form, 
"is a more similar to b or to c." We refer to a as the 
head of the triple. We write p'^^ for the probability 
that a random crowd member rates a as more similar 
to b, so -I- pjfj, — 1. The n objects are assumed 
to have d-dimensional Euclidean representation, and 
hence the data can be viewed as a matrix AI e M"^'*, 
where Ma denotes the row corresponding to a, and 
the similarity matrix K S R"^" is defined by Kab — 
Ma ■ Mb, or equivalently K = MM'^. The goal is to 
learn M or, equivalently, learn K. (It is easy to go 
back and forth between positive semidefinite (PSD) 
K and M, though M is only unique up to change of 
basis.) Also equivalent is the representation in terms 
of distances, d^(a, b) = Kaa — 2Kab + Kbb- 

In our setting, an MDS algorithm takes as input m 
comparisons {aibiCi,yi) . . . (a™ ym) on n items, 

where yi S {0, 1} indicates whether is more like bi 
than Ci- Unless explicitly stated, we will often omit 
yi and assume that the bi and Cj have been permuted, 
if necessary, so that Oi was rated as more similar to 
bi than q. The MDS algorithm outputs an embed- 
ding M e M"^'* for some d > 1. A probabihstic 
MDS model predicts p^^, based on Ma, Mb, and Ale- 
The empirical log-loss of a model that predicts p^^'^, is 
l/m^-logl/p|J'^.. Our probabilistic MDS model at- 
tempts to minimize empirical log loss subject to some 
regularization constraint. We choose a probabilistic 
model due to its suitability for use in combination 
with our information-gain criteria for selecting adap- 
tive triples, and also due to the fact that the same 
triple may elicit different answers from different peo- 
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pie (or the same person on different occasions). 

An active MDS algorithm chooses each triple, 
ttibiCi, adaptively based upon previous labels 
(ai6ici,?/i),. . .,(ai_i6j_iCi_i,?/i_i). Wc denote by 
M'^ the transpose of matrix M. For compact convex 
set W, let UwiK) = argmiuTgiy J2tjiKij - Tijf be 
the closest matrix in W to K . Also define the set of 
symmetric unit-length PSD matrices, 



B = {X h I A'n = if 2 



!}■ 



Projection to the closest element of _B is a quadratic 
program which can be solved via a number of exist- 
ing techniques (Srebro & Shraibman, 2005; Lee et al., 
2010). 

5. Our algorithm 

Our algorithm proceeds in phases. In the first phase, 
it queries a certain number of random triples compar- 
ing each object a G [n] to random pairs of distinct 

6, c. (Note that we never present a triple where a = h 
OT a = c except for quality control purposes.) Sub- 
sequently, it fits the results to a matrix AI E M"^'' 
(equivalently, fits A' ^ 0) using the probabilistic rela- 
tive similarity model described below. Then it uses our 
adaptive selection algorithm to select further random 
triples. This iterates: in each phase all previous data 
is refit to the relative model, and then the adaptive 
selection algorithm generates more triples. 

• For each item a e [n], crowdsource labels for R 
random triples with head a. 



• For i = 1,2, 



— Fit AT* to the labeled data gathered thus far, 
using the method described in Section 5.1 
(with d dimensions). 

— For each a S [n], crowdsource a label for the 
maximally informative triple with head a, us- 
ing the method described in Section 6. 

Typical parameter values which worked quickly and 
well across a number of medium-sized data sets of 
(hundreds of objects) were R — 10^ T — 25, and d — 3. 
These settings were also used to generate Figure 2. We 
first describe the probabilistic MDS model and then 
the adaptive selection procedure. Further details are 
given in Section 7. 

5.1. Relative similarity model 

The relative similarity model is motivated by the scale- 
invariance observed in many perceptual systems (see. 



e.g., Chater & Brown, 1999). Let Sab = || Ma- = 
Kaa + Kbb — '^Kab- A simple scale-invariant proposal 
takes pjj^ = g ^^"g . Such a model must also be reg- 
ularized or else it would have 8(n^) degrees of free- 
dom. One may regularize by the rank of AT or by 
setting Kii — 1. Due to the scale-invariance of the 
model, however, this latter constraint does not have 
reduced complexity. In particular, note that halving 
or doubling the matrix M doesn't change any proba- 
bilities. Hence, descent algorithms may lead to very 
small, large, or numerically unstable solutions. To ad- 
dress this, we modify the model as follows, for distinct 
a, b, c: 



Pbc 



2^ + 5ab + 5ac 



and Kii = 1, 



(1) 



for some parameter fi > Q. Alternatively, this change 
may be viewed as an additional assumption imposed 
on the previous model - we suppose each object pos- 
sesses a minimal amount of "uniqueness," > 0, such 
that K = fj,I + T, where T ^ 0. We fit the model by 
local optimization performed directly on M (with ran- 
dom initialization), and high-quality adaptive triples 
are produced even for low dimensions.^ Here ^ serves 
a purpose similar to a margin constraint. 

There are two interesting points to make about our 
choice of model. First, the loss is not convex in AT, 
so there is a concern that local optimization may be 
susceptible to local minima. In Section 6.1, we state 
a theorem which explains why this does not seem to 
be a significant problem. Second, in Section 6.2, we 
discuss a simple convex alternative based on logistic 
regression, and we explain why this model, in combi- 
nation with our adaptive selection criterion, gives rise 
to poor adaptively-selected triples. 

6. Adaptive selection algorithm 

The idea is to capture the uncertainty about the lo- 
cation of an object through a probability distribution 
over points in M'', and then to ask the question that 
maximizes information gain. Given a set of previous 
comparisons of n objects, we generate, for each object 
a — 1, 2, . . . , n, a new triple to compare a to, as fol- 
lows. First, we embed the objects into as described 
above, using the available comparisons. Initially, we 
use a seed of randomly selected triples for this pur- 
pose. Later, we use all available comparisons - the 
initial random ones and those acquired adaptively. 

^For high-dimensional problems, we perform a gradient 
projection descent on K. In particular, starting with A° = 
/, we compute K*+^ = ns(A' - rjV C{K)) for step -size T] 
(see Preliminaries for the definition of Hb). 
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Now, say the crowd has previously rated a as more 
similar to hi than c^, for i = 1, 2, . . . , j — 1, and we want 
to generate the jth query, {a,bj,Cj) (this is a slight 
abuse of notation because we don't know which of bj 
or Cj will be rated as closer to a). These observations 
imply a posterior distribution of t{x) cx Tr{x) OiPb c 
over X €E M.'^, where x is the embedding of a, and tt{x) 
is a prior distribution, to be described shortly. 

Given any candidate query for objects in the database 
b and c, the model predicts that the crowd will 
rate a as more similar to b than c with probability 



P ^ Ix S{x,b)+S(x,c) 

lar to b than c then x has a posterior distribution of 
nix) oc t{x)^^^^^^^^, and Tc{x) (of similar form) 
otherwise. The information gain of this query is de- 
fined to be H{T)-pH{n) - (1 -p)H{Ta), where H{-) 
is the entropy of a distribution. This is equal to the 
mutual information between the crowd's selection and 
X. The algorithm greedily selects a query, among all 
pairs b,c a, which maximizes information gain. This 
can be somewhat computationally intensive (seconds 
per object in our datasets), so for efficiency we take 
the best pair from a sample of random pairs. 

It remains to explain how we generate the prior tt. 
We take tt to be the uniform distribution over the set 
of points in M. Hence, the process can be viewed as 
follows. For the purpose of generating a new triple, 
we pretend the coordinates of all other objects are 
perfectly known, and we pretend that the object in 
question, a, is a uniformly random one of these other 
objects. The chosen pair is designed to maximize the 
information we receive about which object it is, given 
the observations we already have about a. The hope is 
that, for sufficiently large data sets, such a data-driven 
prior is a reasonable approximation to the actual dis- 
tribution over data. Another natural alternative prior 
would be a multinormal distribution fit to Al. 

6.1. Optimization guarantee 

The relative similarity model is appealing in that it 
fits the data well, suggests good triples, and also repre- 
sents interesting features on the data. Unfortunately, 
the model itself is not convex. We now give some 
justification for why gradient descent should not get 
trapped in local minima. As is sometimes the case 
in learning, it is easier to analyze an online version 
of the algorithm, i.e., a stochastic gradient descent. 
Here, we suppose that the sequence of triples is pre- 
sented in order: the learner predicts K*'^^ based on 
(ai, 6i, ci, yi), . . . , {at,bt,ct,yt)- The loss on iteration 
t is £t{K*) — logl/p where p is the probability that 
the relative model with assigned to the correct out- 



sT{x)dx. If it rates a more simi- 



come. 

We state the following theorem about stochastic gra- 
dient descent, where e -B is arbitrary and iir*+^ = 
nB[K' -TlVit{K')). 

Theorem 1. Let at,bt,ct € [n] be arbitrary, for t ~ 
1,2,.... Suppose there is a matrix K* G B such that 



11 



For any e > 0, there 



2/i+4-2if*^-2_fS'*^ 

exists an Tq such that for any T > Tq and rj — 1/ \/T , 



E 



< e. 



We prove this theorem in Appendix A. 

6.2. The logistic model: A convex alternative 

As a small digression, we explain why the choice of 
probabilistic model is especially important for adap- 
tive learning. To this end, consider the following logis- 
tic model. This model is a natural hybrid of logistic 
regression and MDS. 



Pbc 



1 



1 



(2) 



Note that log 1 -I- e^'^"'^"'' is a convex function of 
K G M"^". Hence, the problem of minimizing its 
empirical log loss over a convex set is a convex op- 
timization problem. 

Experiments indicate that the logistic model fits data 
well and reproduces interesting features, such as 
vowel/consonant or stripedness. However, empirically 
it performs poorly in terms of deciding which triples to 
ask. Note that the logistic model (and any generalized 
linear model) will select extreme comparisons, where 
the inner products are as large as possible. To give an 
intuitive understanding, suppose that hair length was 
a single feature and one wanted to determine whether 
a person has hair length x or x + l. The logistic model 
would compare that person to a bald person (x — 0) 
and a person with hair length 2a; -1-1, while the relative 
model would ideally compare him to people with hair 
lengths X and x + l. 

7. System parameters &; quality control 

Experiments were performed using Amazon's Mechan- 
ical Turk web service, where we defined 'Human Intel- 
ligence Tasks' to be performed by one or more users. 
Each task consists of 50 comparisons and the inter- 
face is optimized to be performed with 50 mouse clicks 
(and no scrolling). The mean completion time was 
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approximately 2 minutes. This payment was deter- 
mined based upon worker feedback. Initial experi- 
ments revealed a high percentage of seemingly random 
responses, but after closer inspection the vast major- 
ity of these poor results came from a small number 
of individuals. To improve quality control, we im- 
posed a limit on the maximum number of tasks a sin- 
gle user could perform on any one day, we selected 
users who had completed at least 48 tasks with a 95% 
approval rate, and each task included 20% triples for 
which there was tremendous agreement between users. 
These "gold standard" triples were also automatically 
generated and proved to be an effective manner to rec- 
ognize and significantly reduce cheating. The system 
is implemented using Python, Matlab, and C, and runs 
completely automatically in Windows or Unix. 

7.1. Question phrasing and crowd alignment 

One interesting issue is how to frame similarity ques- 
tions. On the one hand, it seems purest in form to 
give the users carte blanche and ask only, "is a more 
similar to b than c." On the other hand, in feedback 
users complained about these tasks and often asked 
what we meant by similarity. Moreover, different users 
will inevitably weigh different features differently when 
performing comparisons. 

For example, consider a comparisons of face images, 
where a is a white male, & is a black male, and c is 
a white female. Some users will consider gender more 
important in determining skin color, and others may 
feel the opposite is true. Others may feel that the 
question is impossible to answer. Consider phrasing 
the question as follows, "At a distance, who would you 
be more likely to mistake for a: b or c?" For any two 
people, there is presumably some distance at which 
one might be mistaken for the other, so the question 
may seem more possible to answer for some people. 
Second, users may more often agree that skin color 
is more important than gender, because both are eas- 
ily identified close up by skin color may be identifiable 
even at a great distance. While we haven't done exper- 
iments to determine the importance of question phras- 
ing, anecdotal evidence suggests that users enjoy the 
tasks more when more specific definitions of similarity 
are given. 

Two natural goals of question phrasing might be: (1) 
to align users in their ranking of the importance of 
different features and (2) to align user similarity no- 
tions with the goals of the task at hand. For example, 
if the task is to find a certain person, the question, 
"which two people are most likely to be (genealogi- 
cally) related to one another," may be poor because 
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Figure 2. The 20Q plots comparing training based on 
adaptively selected triples to randomly selected training 
triples. The left plot shows the mean predicted log-ranks 
of randomly chosen objects after 20 randomly chosen ques- 
tions. The right plot shows the mean predicted log-ranks 
of randomly chosen objects after 20 adaptive queries. Plots 
were generated using the mixed dataset consisting of n = 
225 objects, with 10 initial random triples per object. In 
both plots, the performance using 22 = (10 random) + (12 
adaptively chosen) triples was matched using all 35 random 
triples. Hence, approximately 60% more random triples 
were required to match this particular performance level 
of the adaptive algorithm. 



users may overlook features such as gender and age. 
In our experiments on neckties, for example, the task 
was titled "Which ties are most similar?" and the 
complete instructions were: "Someone went shopping 
at a tie store and wanted to buy the item on top, but it 
was not available. Click on item (a) or (b) below that 
would be the best substitute." 

8. Experiments and Applications 

We experiment on four datasets: (1) twenty-six im- 
ages of the lowercase roman alphabet (Calibri font) 
(2) 223 country fiag images from flagpcdia.net, (3) 
433 floor tile images from Amazon.com, and (4) 300 
product images from an online tie store also hosted 
at Amazon.com. We also consider a hand-selected 
"mixed" dataset consisting of 225 images: 75 ties, 75 
tiles, and 75 flags. Surprisingly, it seems that for these 
datasets about 30-40 triples per object suffice to learn 
the Crowd Kernel well, according to the 20Q metric 
that we describe below.. Figure 2 shows the results on 
the mixed dataset, comparing the 20Q metric trained 
on random vs. adaptive triples. For both adaptive and 
random questions, for certain performance levels, one 
requires about 60% more random queries than adap- 
tive queries. Given very little data or a lot of data, one 
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Figure 3. Six objects in the mixed dataset along with the 
adaptive pairs to which that object was compared, below, 
and user selections in red. The first pair below each large 
object was chosen adaptively based upon the results of ten 
random comparisons. Then, proceeding down, the pairs 
were chosen using the ten random comparisons plus the 
results of the earlier comparisons above. 
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Figure 5. Examples of nearest neighbors for floor tiles. 




Figure 6. The flag images displayed according to their pro- 
jection on the top two principal components of a PCA. The 
principal component is the horizontal axis. 



Figure 4. Nearest-neighbors for some neckties from the tie- 
store dataset. Nearest neighbors are displayed from left to 
right. Note that neck ties were never confused for bow ties, 
tie clips or scarves. 

does not expect the adaptive algorithm to perform sig- 
nificantly better. 

Figure 3 shows the adaptive triples selected on an il- 
lustrative dataset composed of a mixture of flags, ties 
and tiles. 

For ease of implementation, we assume all users are 
identical. This is a natural starting point, especially 
given that our main focus is on active learning. 

8.1. 20 Questions Metric 

Since one application of such systems is search, i.e., 
searching for an item that a user knows what it looks 
like (we assume that the user can answer queries as 
if she even knows what the store image looks like), 
it is natural to ask how well we have "honed in" on 




Figure 7. For fun: the faces of 186 colleagues displayed 
according to their projection on the top two principal com- 
ponents. The principal component is the horizontal axis. 
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Figure 8. Empirical results of an SVM using the crowd ker- 
nel, based on leave-one-out cross validation. Note that in 
many cases we hand-selected "easy" subsets of objects to 
label. For example, when judging whether a flag is striped 
or not, we removed flags which might be interpreted ei- 
ther way. The selection was based on how unambiguous 
the objects were, with respect to the desired label, and not 
related to the target kernel. 



the desired object after a certain number of questions. 
For the 20 Questions (20Q) metric, two independent 
parts of the system are employed, an evaluator and a 
guesser. First, the evaluator randomly and secretly se- 
lects an object in the database, x. The guesser is aware 
of the database but not of which item x has been se- 
lected. The guesser is allowed to query 20 triples (as in 
the game "20 Questions") with head x, after which it 
produces a ranking of items in the database, from most 
to least likely. Then the evaluator reveals the identity 
of X and hence its position in the ordered ranking, as 
well. The metric is the average log of the position of 
the random target item in this list. The log reflects the 
idea that the position of lower-ranked objects is less 
import - it weights moving an object from position 2 
to 4 as important as moving an object from position 
20 to 40. This metric is meant to roughly capture 
performance, but of course in a real system users may 
not have the patience to click on twenty pairs of im- 
ages and may prefer to have fewer clicks but use larger 
comparison sets. (Our GUI has the user select one 
of 8 or 9 images, which could potentially convey the 
same information as 3 binary choices.) Now, the ques- 
tions that the guesser asks could be random questions, 
which we refer to as the 20 Random Questions metric, 
or adaptively chosen, for the 20 Adaptive Questions 
metric. In the latter case, the guesser uses the same 
maximum information-gain criterion as in the adap- 
tive triple generation algorithm, relative to whichever 
model was learned (based on random or adaptively se- 
lected training triples). 

8.2. Using the Kernel for Classification 

The learned Kernels may be used in a linear classifier 
such as a support vector machine. This helps elucidate 
which features have been used by humans in labeling 



the data. In the experiments below, an unambigu- 
ous subset of the images were labeled with binary ± 
classes. For example, we omitted the letter y in label- 
ing vowels and consonants {y was in fact classified as a 
consonant, and c was misclassified as a vowel), and we 
selected only completely striped or unstriped flags for 
flag stripe classification. The SVM-Light (Joachims, 
1998) package was used with default parameters and 
its leave-one-out (LOO) classification results are re- 
ported in Figure 8. 

8.3. Visual Search 

We provide a GUI visual search tool, exemplified in 
Figure 1. Given n images, their embedding into and 
the related probabilistic model for triples, we would 
like to help a user find cither a particular object she 
has in mind, or a similar one. We do this by playing 
"20 Questions" with 8- or 9-tuple queries, generated by 
an information-gain adaptive selection algorithm very 
similar to the one described in Section 6. 
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A. Proof of Theorem 1 
A.l. Analysis 

Before we present the proof of Theorem 1, we intro- 
duce a natural generalization which will be a conve- 
nient abstraction. We call this relative regression. 

A. 2. Relative regression 

Consider the following online relative regres- 
sion model. There is a sequence of exam- 
ples {xi,x\,yi),{x2,x'2,y2),---,{xT,x'j,,yT) e 
X X X X {Q, 1}, for some set X C W^. For w e E'*, 
define the relative linear model with w to be, 

w -Xt 

Ptiw) = J. 

W ■ Xt + w ■ x^ 

The sequence Xi,x'i, ... ,XT,x'rp is chosen arbitrary (or 
even adversarially) in advance; afterwards it is as- 
sumed that there is some w* E M.'^ such that, Pr[yt = 
1] = pt{w*) and that the different yt^s are indepen- 
dent. It is further assumed that w* belongs to some 
convex compact set W CM."^ and that w ■ x is positive 
and bounded over w & W,x & X. Without loss of 
generality, by scaling we can require w ■ x € [I,/?] for 
some /3 > and every w €W,x E X. 

On the tth period, the algorithm outputs Wt G W, then 



Adaptively Learning the Crowd Kernel 



observes Xt,x'^,yt, and finally incurs loss £t{wt) where 
it{w) = \ogl/pt{w) ii ut = 1 and it{w) = log 1/(1 - 
Pt{w)) if Ut — 0. The goal of the algorithm is to incur 
total loss not much larger than X]t^t(^*)i hest 
choice had we known w* in advance. 

We note that an analogous (and slightly simpler ver- 
sion of) the following lemma can be proven for squared 
loss. 

Lemma 1. Let X,W C M'^ and suppose that W is 
compact and convex and 3a > such that for all x S 
X, w & W : \\x\\, \\w\\ < 1, and w • x > a. The for 
any 77 > and any w'^ G W and — Ilw{w^ — 



T 



-E 



Y,lt{wt) - lt(w*) 



< 



V 



Trja 



In particular, for 77 = ^lajT ^ this gives a bound on 
the right-hand side of \J~^^^- 

Proof. Following the analysis of Zinkevich (Zinkevich, 
2003) we consider the potential equal to the squared 
distance {wt — w*)"^ and argue that it decreases when- 
ever we have substantial error. Let Vi = W£t{wt) € 
M^, which is, 



xt + x[ 



Vt- 



Xt 



Wt ■ Xt+Wt ■ x[ Wt ■ Xt 



Wt ■ X't 



By the triangle inequality ||Vt|| < G for G = ^. Now, 
as Zinkevich points out, due to convexity of W, (w — 
^wiv))"^ < {w - for any v e and w £ W. 
Hence, 

{W* -Wt+lf < (W* -Wt+7jVt)^. 

Thus the decrease in potential, call it At = (w* — 
Wt)"^ — {w* — wt+i)^, is at least: 

At > {w* - wtf - {w* -wt+ r^Vtf 
^2j^Vf{wt-w*)~ri'Vl 

Next, we consider the quantity, E[Af • w*], where 
the expectation is taken over the random yt (fixing 
2/21 • ■ • , yt-i)- By expansion, the expectations is: 

w*-xt+w*-x'^ W*-Xt W*-X'f. 

-pt(w ) {1-Pt{w ))- 



W ■ Xt + W ■ Xt 



Wt ■ Xt 



Wt ■ Xt 



After simple algebraic manipulation, which is difficult 
to show in two-column format, we have, 

E[At • w*] = -Ztiptiw*)-ptiwt)y where 

^ _ {wt ■ Xt +Wt ■ Xt){w* ■ Xt + W* ■ x't) 

(wt ■ xt){wt ■ x't) 



Also note that At • Wf =0 regardless of yt- Hence, 
E[At • Wt] = 0. Combining these with the fact that we 
have shown that At > 2riWt ■ {wt — w*) — rj^C^, gives. 



E[At] > 2T^Zt{ptiw*)-pt{wt))^ ~v^G^ 



> 2?7 



w* ■ xt + w* ■ x't {pt{wt) -pt{w*))^ 



Wt - Xt+Wt ■ x't ptiwt){l -pt{wt)) 



rfG 



> 2f]a 



(ptiwt) - pt{w*))^ 2 



Pt{wt){l~pt(wt)) 



In the last line we have used the fact that w - x G [a,l] 
foi w £ W,x G X. Now, by Lemma 2 which follows 
this proof. 



itiwt) - £t{w*) < 



(ptiwt) ~ ptiw*))^ 



Pt{wt){l - Pt{wt)) ' 
Combining the previous two displayed equations gives, 

E[At] > 2r,a(^t(wt) - it{w*)) - V^G. 

Finally, since the potential {wt — w*)"^ > 0, we have 
X; At < {wo -w*)'^< 4. Hence, 



J2^tiwt)~etiw*) < 



TifG + 4 
2ria 



Substituting G = 2/a gives the lemma. 



□ 



Lemma 2. Let p + q = 1 and p* + q* = 1 for p, p* G 
[0,1]. Then, 



,*\2 



p log h g log — < 

p q PQ 



Proof. By concavity of log, Jensen's inequality implies, 
p log h g log — < log 1 . 

p q p q 

Simple algebraic manipulation shows that, 

(p*)' , (q*)' , iP-P*)\ 



p 



pq 



Finally, the fact that log 1 + x < a; completes the 
lemma. □ 

A.3. Proof 

To prove theorem 1, map matrix S E M"^" to a vector 
w{S) S consisting of the constant p + 2 in the 

first coordinate followed by the entries of S. Taken 
over the set of symmetric S ^ such that Su — 1, 
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the vectors w{S) for a compact convex set of radius 
y/n^ + + Also, 

is our relative regression model for w = w{S), x G 
R^+" being the vector with a 1 is the first position 
and -I's in the positions corresponding to the ac and 
ca entries of S (and zero elsewhere), and x' having a 
1 in the first position and -I's in the positions cor- 
responding to ab and ba (and zero elsewhere). The 
inner product of u;(S') and x is fj, + Sab and hence is 
bounded. To apply Lemma 1, one must scale w{S) and 
X, x' down. However, it is clear that for any e and T 
sufficiently large, setting rj — 1/VT in Lemma f gives 
the necessary bound. 
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